My goal with this analysis is to find out whether Brazilian deputies have been using their reimbursement rights inadequately.

These reimbursements are called Quota for the Exercise of Parliamentary Activity and are “designated to pay for expenses exclusively linked to the exercise of parliamentary activity”. Therefore, as far as I could tell, there are two main ways we could detect an improper reimbursement claim (given the available data): * If the refund category is suspicious or * If the time component of the refund is suspicious

Dissemination of Parliamentary Activity

To investigate suspicious refund categories I tried plotting a bar chart of total refunds per category (of the top 7 categories). Much to my surprise, I had already found something very weird.

# Plot total value of refunds
description_summary %>%
  filter(
    row_number() > 10
  ) %>%
  plot_ly(
    x = ~refund_description,
    y = ~refund_tot,
    type = "bar",
    color = ~refund_description,
    colors = "Accent",
    opacity = 0.70
  ) %>%
  layout(
    legend = list(
      orientation = 'h',
      traceorder = "reversed",
      x = 0.5, y = -100
    ),
    yaxis = list(
      title = ""
    ),
    xaxis = list(
      title = "",
      showticklabels = FALSE
    )
  )

Apparently “dissemination of parliamentary activity” is the category that has had the highest overall cost for the taxpayer: a total of R$48,645,429.54. Since this refund description is very vague, it seems to me that it is being widely used by the deputies as a cover up improper refunds.

Just to make sure I wasn’t being too quick to judge, I decided to look into this a little further and created a box-plot for the 7 categories with the highest mean refund values.

# Plot refund descriptions
deputies %>%
  filter(
    refund_description %in% d$refund_description[11:17]
  ) %>%
  plot_ly(
    x = ~refund_description,
    y = ~refund_value,
    type = "box",
    color = ~refund_description,
    colors = "Paired"
  ) %>%
  layout(
    legend = list(
      orientation = 'h',
      traceorder = "reversed",
      x = 0.5, y = -100
    ),
    yaxis = list(
      title = ""
    ),
    xaxis = list(
      showticklabels = FALSE,
      title = "",
      mirror = TRUE
    )
  )

In the image above, dissemination of parliamentary activity doesn’t have the highest median, but it’s outliers stand out from the rest. If we examine to top outlier of this category (and of the whole plot), we find that it corresponds to R$184,500.00 being reimbursed for expenses at a small print shop, corroborating to the hypothesis that this category is in fact being misused by the deputies.

The next logical step would be doing some text mining on the company names to identify their types, and then check if they correspond with the assigned refund categories. For the sake of brevity, I won’t attempt to do this.

Conclusion

This wasn’t in any way an exhaustive report, but I think it is already able to supply enough evidence to the idea that Brazilian deputies are using public money for personal gain. In the beginning we set out to investigate two aspects of the dataset, and were able to find suspicious activity on both; there are probably many other ways to look at this data that I didn’t think of or didn’t have time for, so I highly encourage you to analyse it for yourself!

If you’d like to know more about this subject I suggest Serenata de Amor, an initiative created by Brazilian data scientists that uses AI to flag suspicious reimbursement claims. And I also have a blog where I post regularly about my data visualization and analyses, so if you liked this one there is a big chance you’ll like the rest!